Gradient Coding With Dynamic Clustering for Straggler-Tolerant Distributed Learning
نویسندگان
چکیده
Distributed implementations are crucial in speeding up large scale machine learning applications. gradient descent (GD) is widely employed to parallelize the task by distributing dataset across multiple workers. A significant performance bottleneck for per-iteration completion time distributed synchronous GD straggling Coded computation techniques have been introduced recently mitigate stragglers and speed iterations assigning redundant computations In this paper, we introduce a novel paradigm of dynamic coded computation, which assigns data workers acquire flexibility dynamically choose from among set possible codes depending on past behavior. particular, propose coding (GC) with clustering, called GC-DC, regulate number each cluster forming clusters at iteration. With time-correlated behavior, GC-DC adapts behavior over time; iteration, aims as uniformly based straggler For both homogeneous heterogeneous worker models, numerically show that provides improvements average without an increase communication load compared original GC scheme.
منابع مشابه
Near-Optimal Straggler Mitigation for Distributed Gradient Methods
Modern learning algorithms use gradient descent updates to train inferential models that best explain data. Scaling these approaches to massive data sizes requires proper distributed gradient descent schemes where distributed worker nodes compute partial gradients based on their partial and local data sets, and send the results to a master node where all the computations are aggregated into a f...
متن کاملGradient Coding: Avoiding Stragglers in Distributed Learning
We propose a novel coding theoretic framework for mitigating stragglers in distributed learning. We show how carefully replicating data blocks and coding across gradients can provide tolerance to failures and stragglers for synchronous Gradient Descent. We implement our schemes in python (using MPI) to run on Amazon EC2, and show how we compare against baseline approaches in running time and ge...
متن کاملRedundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning
Performance of distributed optimization and learning systems is bottlenecked by “straggler” nodes and slow communication links, which significantly delay computation. We propose a distributed optimization framework where the dataset is “encoded” to have an over-complete representation with built-in redundancy, and the straggling nodes in the system are dynamically left out of the computation at...
متن کاملEntropy-based Consensus for Distributed Data Clustering
The increasingly larger scale of available data and the more restrictive concerns on their privacy are some of the challenging aspects of data mining today. In this paper, Entropy-based Consensus on Cluster Centers (EC3) is introduced for clustering in distributed systems with a consideration for confidentiality of data; i.e. it is the negotiations among local cluster centers that are used in t...
متن کاملSource-Optimized Clustering for Distributed Source Coding
Motivated by the design of low-complexity distributed quantizers and iterative decoding algorithms that leverage the correlation in the data picked up by a large-scale sensor network, we address the problem of finding correlation preserving clusters. To construct a factor graph describing the statistical dependencies between sensor measurements, we develop a hierarchical clustering algorithm th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Communications
سال: 2023
ISSN: ['1558-0857', '0090-6778']
DOI: https://doi.org/10.1109/tcomm.2022.3166902